Get Data

About the Data

This dataset is based on a set of Amazon review of video games downloaded from Prof Julian McAuley’s website. The original json dataset was parsed into a dataframe and cleaned. Since this dataset is very large, I have only included data from Aug 2013 to July 2014 which gives us a year of data to work with. Also we are only using data on three fields, id, review and review rating. Finally, three blank reviews were dropped from the data. The data we are going to use includes the following fields:

  • id: A unique identifier for each review
  • review: Text of review posted on Amazon
  • review_rating: Each review on Amazon is rated by others using a five-star scale (presumably based on helpfulness of review).

Read in Reviews

You must read the data before trying to run code on your own machine. To read data use the following code after setting your working directory.

videogame = read.csv('video_game_reviews.csv',stringsAsFactors = F)

Explore

Structure of dataset

str(videogame)
## 'data.frame':    26652 obs. of  3 variables:
##  $ id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ review       : chr  "1st shipment received a book instead of the game.2nd shipment got a FAKE one. Game arrived with a wrong key ins"| __truncated__ "I still haven't figured this one out. Did everything as instructed but the game never installed. Strange. Since"| __truncated__ "I bought this and the key didn't work.  It was a gift, and the recipient wasn't able to solve the problem.  It "| __truncated__ "I love it! Use it all the time. Really works perfectly with all games that you need a mic for." ...
##  $ review_rating: int  1 2 1 5 5 1 5 5 5 5 ...

Ratings of Reviews

median(videogame$review_rating) # median review rating
## [1] 5
mean(videogame$review_rating) # mean review rating
## [1] 4.194544
# Using dplyr
library(dplyr)
videogame%>%
  summarize(average_rating = mean(review_rating), 
            median_rating = median(review_rating))

Distribution of Reviews

library(ggplot2); library(ggthemes)
ggplot(data=videogame,aes(x=review_rating))+
  geom_histogram(fill='sienna3')+
  theme_bw()+
  scale_x_reverse()+
  xlab('Review Rating')+
  coord_flip()

Review 617

To get a feel of the text of the reviews, let us examine simple text characteristics of a randomly selected review, 617.

Characters

One measure of length of a review is the number of characters.

videogame$review[617]
## [1] "Like scottrocket3 said, you're out of your damn mind if you buy this for $200.  This is the best Mario game that ever came out.  The other is Super Mario World for Super Nintendo, also available for Nintendo DS.  I can't say enough about this great game!!  I LOVED the Super Mario Brothers Super Show featuring wrestling great Captain Lou Albano who, unfortunately passed away recently.  He was a Christian, so I will see him again.  Don't know about Danny Wells who did Luigi.  This is worth every penny you spend on it.  Unless of course you spend $100+dollars on this.  Mario first got me hooked on mushrooms.  Since then, I eat them by the truckload!!  They're good for you & have vitamin D, the sunshine vitamin."
nchar(videogame$review[617])
## [1] 717

Words

Another measure of length is number of words. While counting words may seem too elementary to warrant discussion, take a moment to think of how you would tell a computer to identify a word in text.

In the illustration below, we will convey the definition of a word to the computer as a pattern. The function str_count from the library stringr will identify words based on the pattern and count them.

In this case, it will count the number of times non white space appears in review 617. More about patterns below.

library(stringr)
str_count(string = videogame$review[617],pattern = '\\S+')
## [1] 127

Sentences

Now, let us count the sentences. As above, it is worth pondering the definition of a sentence. The definition of a sentence is then encoded as a regular expression. Regular expressions (regex) is a framework for teaching a computer how to recognize patterns of text. Regex has a consistent implementation across many different programming languages and you can read more about it here.
We will define a sentence as being as “a set of characters or punctuation (comma, quote) or spaces that end with one or more period, question mark, exclamation mark or combination of them.” The following reqular expression would match this definition:

str_count(string = videogame$review[617],pattern = "[A-Za-z,;'\"\\s]+[^.!?]*[.?!]")
## [1] 12

All Reviews

Number of characters, words or sentences in review 617 is hard to evaluate out of context. So, let us get some context by examining a summary of characters, words, and sentences across all reviews.

Characters

Number of characters, words or sentences in review 617 is hard to evaluate out of context. So, let us get some context by finding the mean and median across all reviews. Characters across all reviews.

summary(nchar(videogame$review))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0   142.0   270.0   673.4   667.0 24012.0

Words

summary(str_count(string = videogame$review,pattern = '\\S+'))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    27.0    51.0   123.3   124.0  4467.0

Sentences

summary(str_count(string = videogame$review,pattern = "[A-Za-z,;'\"\\s]+[^.!?]*[.?!]"))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   4.000   7.575   8.000 416.000

In summary,

library(dplyr)
videogame %>%
  select(review)%>%
  mutate(characters = nchar(review),
         words = str_count(review,pattern='\\S+'),
         sentences = str_count(review,pattern="[A-Za-z,;'\"\\s]+[^.!?]*[.?!]"))%>%
  summarize_at(c('characters','words','sentences'),.funs = mean,na.rm=T)

Longest and Shortest Review

Let’s use the number of words in a review to measure the length of a review. The shortest review

shortest_review_index = which.min(str_count(string = videogame$review,pattern = '\\S+'))
videogame$review[shortest_review_index]
## [1] "lol"

And the longest review

longest_review_index = which.max(str_count(string = videogame$review,pattern = '\\S+'))
videogame$review[longest_review_index]
## [1] "You all knew this was coming, buckle in too... it's gonna be a wild ride.Bravely Default, as many of you know, is Square-Enix's latest RPG. I like to think of it as Square holding this game out in front of you going "Hey, look LOOK We can still make RPGs... no no don't look at FFXIII look at this." This is of course not true, but it's funny in my head. The game is Final Fantasy down to every last drop, the only thing it's missing is Chocobos, Mogs, and the words "Final Fantasy" in it's title. Let's go through the FF Checklist shall we?1. Party of characters that have no business having the power of gods because they are like... 17. Check2. Hair styles that require at least 1 gallon of hair gel a day? Check3. Classes and Spells that only Square uses (Fira, Blizzaga, etc...). Check4. Final Fantasy Items: Hi-Potions, X-Potions, Ethers, Turbo Ethers, Antarctic wind etc... Check5. "Multi-hits" (when you attack it says 4-hit) Check6. "Limit Breaks" of some kind? Check7. Airships? Duh! Check8. Final Fantasy jobs? Let's see... Black Mage, White mage, Red mage, Monk, Ranger, Summoner... CHECK8. Four Elemental Crystals that serve as the centerpiece for the world making it run in harmony until one day something shadowy and/or evil happens to them that requires a party of unlikely heroes to band together and make the crystals shiny again... CHEEEEEEEEEECK.Ok now that we've established that this is Final Fantasy V-2 (May as well be). ON WITH THE REVIEW.I'm going to say this right out of the gate. If you have any desire and I mean ANY slight liking of turn based JRPGs, and you have a 3DS. Get this game. Now. If you don't have a 3DS, go buy one and get this game. The amount of things in this game is astonishing and it's just a blast to play, while at the same time it feels like a mixture of modern and older mechanics... weighted more to the older side. I'm going to section out this review so if you are interested in the story or combat you can skip to those sections. Not what I USUALLY do but I'm feeling different today. I have not finished the game but I am at the end of chapter 2. That may not seem like far but that's like... 20 hours already. I've had enough time to get a grasp on the games existence!StoryI can't rate a story until I've finished it, and even then it's not really a ratable thing. However, I can say that it's at least so far interesting. The following is not a spoiler, unless you count what you learn in the first like... 20 minutes a spoiler. The world is held together by 4 Elemental Crystals, Water, Fire, Wind, and earth. Each crystal has a vestal that keep the crystal all shiny and happy through prayer (It's the games religion. Crystalites or something I believe it was called). One of your 4 party members is the Vestal of wind, Agnès (pronounced on-yes... not kidding). In the opening video we see bad things happening, some sort of shadow has taken over the crystals. Gee... haven't heard THAT one before... Meanwhile Tiz's entire village is reduced to a giant hole. The goal, as far as I can tell, is to go around the world and restore the 4 crystals. *Ahem* Totally not FF1 or FFV.While you are on this task, another major power called the Bloodrose Legion (I believe) is trying to stop you, since they are anti-crystals and think that the world should be better off without them, yet the second the crystals stopped working the world... more or less died so I'm not sure what logic they are using. Either way the story does what it needs to do, it provides a way to get from Point A to Point B and it actually does something I approve of, it doesn't get in the bloody way. Not once since this started has there been a GIANT gap of story, and by that I mean you finish a task and you spend the next 2 hours listening to people talking about this and that. The talking scenes are quick, they get to the point for the most part, and don't get boring. Just goes to show Square can still tell a story without it being 8 feature length films. With some gameplay in between.CharactersAgain, can't really rate this one. The four main characters all have a distinct personality.Tiz, the cliche hero... I mean Shy Farmer, witnessed a terrible cliche... I mean his village being destroyed, and joined up with the Vestal of wind in the hopes of restoring his village. He is a quite type that is willing to protect his friends yadda yadda yadda. Walking. Talking. Cliche.Agnès, or as I like to say dear god please stop talking now my EARS ARE BLEEDING (more on that later). She is the Vestal of Wind and has been tasked with fixing the worlds problems, despite everyone hating her in nearly every town I seem to go to. Then they like her once Agnès and the gang fixes something. Silly non-believers. She is innocent, having spent her entire life away from the common-folk and doesn't know a darn thing about what goes on in the world.Ringabel, the games other male character. He has, and stop me if you have heard this one before, Amnesia! he can't remember anything, but he does have a journal that seems to be detail everything that happens before it happens and serves as a plot device constantly. He is also a self-proclaimed ladies man and hits on pretty much every female in the area around him. He has some great lines though even if most of the time you want to kick him in the back of the head.Edea, Actually I saved the best for last as she is by far my favorite character in this game. She betrayed the army she was with to protect Agnès and has probably got some of the best bits of dialog in the game, particularly when talking to Ringabel who has fallen in love with Edea.So there you have it, 2 walking Cliches and 2 characters that have a bit of spirit to them. They work well and have a nice synergy about them. I enjoy their interactions from Agnès's cluelessness to Ringabels... idiocy and Eda calling Ringabel out on it.By far my favorite line in the game thus far has been:Ringabel: "I know, I've been beating my head against the wall trying to figure it out."Edea: "Oh? Perhaps you aren't hitting the wall hard enough, here let me assist you"Dunno why, but it makes me laugh!Voice Acting - 2/10OK... so I've described the main characters and they are fine and all but I'm going to touch on what I believe to be the most painful part of this game... it's voice acting. I don't say this lightly, but it is among the worst voice acting I have ever heard in a video game to date. There are worse examples in other games (Maro maro from Blue Dragon for example) but, on the whole this entire game's VAing is bad. I even rate Star Ocean 4's voice acting above this and THAT is saying something.Agnès has got by far the worst voice, it's grating, annoying, breaks constantly and just is all around unpleasant. Couple that with the fact that the sound quality of the game is so compressed it's just sad and you have an unpleasent experience. Ringabel and Edea actually have decent voices at least, they have emotion, and sound higher quality for some reason. Those Actors were trying. Some of the enemies also have decent voices, while others have ungodly terrible voices. I switched it to Japanese to see the difference and it just created a different world of issues, some voices got better while others got worse but at the same time the quality was still bad so I switched it back to English.I almost want to recommend playing the game with the voices off (you can mute them specifically) but at the same time I feel that would take a bit away from the game, because at least Ringabel and Edea have fun things to say and they say it well, so... worth it. If I could mute Agnès I would. Without question. I know they had to compress it because of the 3DS but... come on... you may as well not recorded half of it and saves some space there.Customization 10/10 - This is split into 2 seperate systemsJobsThrowing back to the FFV days, Square decided to reintroduce the Job system... and I could NOT be happier with this. The game calls it the Asterisk system but... I find that ridiculous so I'm going to call it the job system since they are actually called Jobs in the game anyway... I think they are just confused.The game has something like 23 or 24 different jobs that each character can be. As you defeat enemies you get both EXP and JP (Job Points). Jobs have levels and you advance them by gaining these job points. Each level will unlock either a new job skill or a new job ability. For Example a black mage unlocks tiers of magic as their skills. Level 1, 2, 3 etc... black magic. But that's not all they get, they also obtain abilities like Immunity to silence, Reduced damage from Lightning, etc... these abilities and skills can then be used for other jobs. So if I switch the Black mage to a white mage, I can set their secondary skills to be black magic. Now I can cast both White and Black magic as my White mage. The abilities are also interchangeable, by the end of the game you get 5 ability slots, each ability takes X number of slots (from 1 - 3 I believe) and each does something different. For instance monks gain +10% HP for 1 point, +20% Hp for 2 points and +30% HP for 3 points, combine that with the knights  + 20% physical Defense and you can make an impressive tank character. Or combine 2 Handed (Wield a weapon with 2 hands for double attack power) and +20% Physical Attack on a spell fencer for some massive elemental Damage.The combinations are vast and for me have been really fun to mix and match. I'm up against a boss right now that I can't beat just yet because of one attack it uses that deals massive water damage, so I'm working on unlocking abate water (50% reduced damage from water sources) so I can withstand that assault. The jobs, leveling them, unlocking skills and abilities keeps me interested far more than I thought in this game, and the only wish I had was that I had more characters in my group to try more combinations!You actually don't unlock the jobs through the main story line either, at least not all of them. So far apart from the first few I got as part of the tutorial, all of my jobs have been unlocked through side quests... and these aren't simple tiny side quests they are actual quests. There is trouble in town and you have to figure out what it is! I won't spoil the story but the first 2 side quests I took unlocked 4 more jobs for me to use. So getting all the jobs in the game is going to be fun. At this time I have 11 unlocked I believe. Ranger being my favorite, because it does MASSIVE damage and knows the weakpoints of most of my enemies... and I can use those skills on my other jobs so imagine having all those 1.5x damage skill, then switching to a spell fencer, enchanting your sword with fire and the 2 handed ability, then attacking a plant monster who is weak to fire with the ability that deals 50% more damage to plants... ooooh it's going to be a dead plant!SpecialsSo this game has something similar to limit breaks from FFVII, they are called Specials. You unlock them as the game progresses (more on this later) but you start with a couple in the beginning. Each weapon type has a few specials attached to it and various ways with which to trigger it. The Rod for example requires you to deal damage with magic spells 10 times, and it's special is a single target laser beam that deals heavy damage to 1 opponent... but that's not all. Each special has a secondary buff that is granted to your party for a few turns. So the rod's special also grants +20% additional Magic Damage for a few turns to the whole party. For another example, the Bow requires you to deal weak point damage to enemies 10 times (Rangers are good at that). The attack does massive damage to 1 enemy and raises your parties crit rate by 400% for a few turns... while also looking awesome. There are also support specials like Rejuvenation that heals the party etc... or buff the party and keep everyone protected.Now you may ask why this is under the customization section, well that's because the specials are customizable! As you progress through rebuilding the village (More later on this) you unlock abilities that you can equip to your specials. For instance you can make your special Water based, or Fire based, make it do more damage to plants or Dragons, have a higher crit rate etc... This can be a good and bad thing because it does require a bit of planning. I went into a boss fight with a special that actually healed the boss because it was water based and I forgot to switch it for lightning. But I think it's still awesome how we can change them, make them a little more ours... and not just with abilities. You can rename them! I renamed everything to try and include badger in the title. So Hack and Slash became Badger Strike. Rejuvenation became Badger's Light etc...You can also customize what your characters say when they use the specials. Slightly limited to like... 15 characters but still it's a nice touch. I know it's a nitpick but I would like to have my character say the line while attacking not afterwards. So As I'm shooting my arrow I want him to scream "EULALIA!" (as i have told him to do). not AFTER I have shot it.Combat - 8/10Now being me, I am rather biased against Turn based combat systems, They are either boring, or just insanely brutal against you to the point of tears (Hello Persona...) however I'm going to pick that bias up and throw it out the window for this because it deserves my full unbiased viewing.The combat is great, it's fun and more complicated than simply pressing a single button untill things die. This game isn't a sit back and watch the program type of combat. You have to think. Admittedly part of winning is the setup before combat which I think is a good thing. Think a bit before you go into a fight and your fight should be much faster. bravely Default However, adds a twist. The Brave/Default system really spices things up in my opinion. The system is simple, yet complicated. In combat you have three choices, Attack, Brave, or Default. Attack meaning you take your turn, attack or use a skill/spell expending 1 BP (or more if a skill requires it) and that's about it. As long as you don't have negative BP you can act, you regenerate 1 BP a turn normally up to maximum of 3. Default allows you to defend, taking less damage and not acting this turn, in effect restoring 1 BP. Now these two options are fairly normal... then we get to the third option, and it's one I think makes the game stand out above others. The Brave Option.The Brave option allows you to act more than once when it's your turn. You can use it up to 3 times in 1 turn (Using 4 BP to do so). This allows you to use 4 attacks in 1 go, however it can leave you with a massive amount of Negative BP, which means you can't attack or defend until you don't have negative BP.The system may seem risky and it is, but it has purposes which Are brought forth by the job system. For instance one of the first skills you get as a monk is Invigorate, which increases your attack power by 25% for 2 turns (It has a chance to fail but lets for the sake of argument say it didn't!). Now in most instances you could only get 2 attacks off with that bonus attack, but if you use Brave to get 4 attacks in 1 turn off, you get to hit for 25% power 3 more times than usual. Of course after which you are usually exposed for a turn or 2, but who cares! Another good example is that say you just hit by a massive painful attack, 2 of your characters have died, and your white mage needs to get them up. Well use all 4 braves and you can cast raise twice, then cura twice. And if your like me your white mage was Defaulting for a few turns so he won't be left defenseless after a turn like that. It's just a lot of fun.Now, there are more risks than Just the obvious You can't act for a bit. The enemy can do this as well, they can use Brave and Default. More than once I've fought a boss that used Default the exact turn I decided to throw everything I had at him, and I ended up losing because He withstood my assault and destroyed my soul. It also makes the game a little less easy to predict, and adds a bit of fun to it. You never really know what the enemy is going to do! They are also limited by BP so don't expect them to attack 9 times or something. You can at all times see what the bosses BP is so you can guess on if you are about to be slammed by death or not.Now it's not perfect, and I didn't know where to take points off in this section or another but I chose this one. It being a JRPG does mean there is some grinding involved. And everyone who has played this type knows what I mean. You run around in circles on the map waiting for random encounters, you find a strategy that kills them in one turn and you keep doing that over and over and over again, Sometimes that Strat is a bit harder to find but once found it works endlessly. So I am not a big fan of that part, I can handle it, and it is still fun getting to that next job level. The game does progression amazingly well, you really do feel like you are gaining power with each level increase.The other point I take off is because of the fact that sometimes I feel like the AI is cheating. This may be me being paranoid, but rarely does the computer default until RIGHT after I decide to use everything I have at once, it's almost like clockwork. I think it waits and then acts after I've made a decision. It's annoying! but... alas it's probably my faut for trying to do everything in one turn. One other thing I will say is that the bosses are rough. Really rough. They require planning and have some incredibly cheap and devastating attacks.Network Features and Mini Things - 7/10Here we get into some of the slight issues I have with the game. This being the year 2013 (I know it's 2014 but the game was release originally in 2013!) We have decided that everything has to involve friends and friends lists and the internet and stuff. A theory that I sadly don't prescribe to. However, I can't fault it for trying either. There are just some key issues I think don't work that well and they inhibit the game somewhat. First I'll explain what it involves though.You can Summon friends to help you in combat, as well as be summoned by others. Now I don't mean a co-op or anything it's still a single player game. What I do mean is that if you have an internet connection you can register an attack or a heal and send it to your buddies to use as an attack in their game. For example I cast Rejuvenate on my entire Party and send it to the internet. Anyone on my friend's list can now summon my character and cast rejuvenate on their party (only once though). The more friends you have on your friends list the more options you can have to summon etc... Now there is an inherent issue with this. This being a Nintendo game, it's STILL annoying as hell to register people onto your friends list. You have to exchange codes and all that fun crap still, and without a forum or a central place where you are giving people codes you will not get them. You also have to have a wireless internet connection which yes at this point is fairly common but not everyone has it. The good thing is, Summoning and sending doesn't really add much to the game other than a bit of fun. The other feature however is more reliant on it.There is a mini-game in this game, and it's the restoration effort for Tiz's village. You can restore Shops, Get new specials, get special modifiers as well as weapons and armor you can't buy elsewhere in the game. This is a time consuming game that thankfully performs all it's tasks in the background or while your 3DS is in sleep mode. As I write this review, my villagers are at home building my shops up, and each one will take about 6 hours to do so. I left some major projects going while I went to work. Now you start the game with 1 villager, and he can build up things slowly. If you connect to the internet 1 time a day you slowly gain more and more villagers, this really helps in the restoration effort. But it also highlights the biggest problem I have. What if you don't have that option. If you can't connect to the internet, or you can't find anyone with spot pass... you are stuck with 1 villager and the amount of time it would take for him to do everything is months. And in that line, you would miss out on a lot of specials and cool items. It is easily remedied by going to a free Wi-Fi place which... heck I think even McDonald's has free wi-fi now, but still it's a tad off.Then there is the one thing I think wasn't needed in this game, it actually has a Micro transaction shop in it. What can you buy? Gear? Armor? Items? Noooo... you can buy Turns to use in combat. That's right.. you can Buy the ability to use an additional turn. I don't know the cost, because the first one is discounted at 50 cents and I didn't buy it. The other way to gain additional SP (That's the currency for extra turns) is to put the 3DS into sleep mode for 8 hours with the game running it's tasks. You can have up to 3 and can use them in combat whenever you want by pressing the start button. Honestly though. I don't really see the need for buying them. I can't say it's like the devil that it's in the game, but I have to question why it's there in the first place. It seems... just... odd to me.Which also brings up my final thought on the mini-game... you do have to keep your 3DS in sleep mode CONSTANTLY. Right now mine is plugged into the wall at home, and It's going to be on for a very very long time considering how much of the village I have to restore. I mean 5 villagers working on clearing one step of the village will take 20 hours. (99 alone!). But it is always a joy to come home or wake up after letting them work for 8 hours and getting all those rewards. No you can't pay to get them to work faster!You also have the ability to customize a few bits of difficulty in the game. Allowing you to change the frequency of encounters from 0 (none) to +100% (A lot more). This allows for better grinding, or getting back to town if you really didn't wanna fight things 20 levels lower than you! You can also enable or disable JP, EXP, Autosave, and change the difficulty from Normal to Hard or Easy. Hard mode... In my opinion is just frustrating and gives no reward for you actually playing in hard mode. At least none that I saw. It makes every fight brutal and basic enemies can one shot you, I didn't honestly enjoy it all that much, so I switched back to normal. And even then bosses destroy me. Seriously though I could not play this game without Autosave, I forget to save too often and would be annoyed with losing time!All in all, I give the game 9 Bitter Badgers out of 10. It's fantastic, and I can't seem to stop playing the damn thing. If you liked Final Fantasy Anything before XII, you will probably love this game. It has customization, craziness, and joy around every corner. The combat is fun yet challenging, the bosses may be a bit brutal, bit nothing a little grinding or better planning won't fix!TLDR Version: 9/10.Pros+Good risk vs. Reward combat+Decent Characters with lots of cliches thrown in+Nearly unparalleled amount of Customization for the type of game it is. From jobs to abilities to your special attacks.+Decent Story, doesn't hold the game up and has some fun dialog+Challenging if you want it to be.+Fairly easy if you want it to be, there is even an auto battle feature+It's final Fantasy V-2. No other way to put it!Cons-Voice acting is atrocious-Sometimes the AI can feel cheap-The internet features can be limiting if youo don't have the option to connect to them.-Did I mention the voice acting is BAD?-If you went digital it takes up basically an entire 4gb SD card.-Gotta leave your 3DS in sleep mode for the village to be rebuilt in any logical amoutn of time even with a good amount of villagers."

How many words are in the shortest review

str_count(videogame$review[shortest_review_index],pattern = '\\S+')
## [1] 1

Correlation between Length and Rating

The variation in the length of reviews is striking. This begs the question, are longer reviews seen as being more helpful? To answer this question, let us examine the correlation between length and review_rating.

r_characters = cor.test(nchar(videogame$review),videogame$review_rating)
r_words = cor.test(str_count(string = videogame$review,pattern = '\\S+'),
                             videogame$review_rating)
r_sentences = cor.test(str_count(string = videogame$review,pattern = "[A-Za-z,;'\"\\s]+[^.!?]*[.?!]"),
                                 videogame$review_rating)

correlations = data.frame(r = c(r_characters$estimate, r_words$estimate, r_sentences$estimate),
                          p_value=c(r_characters$p.value, r_words$p.value, r_sentences$p.value))

rownames(correlations) = c('Characters','Words','Sentences')
correlations

Grammar

Meaning may not only come from words but also grammatical elements such as punctuation and case. One may react differently to text that is in upper case compared to sentence case. Use of upper case may be seen as an expression of anger. Similarly, a set of consecutive exclamation marks may convey surprise or emphasis. Let us examine the prevalence of upper case letters and exclamation marks in the reviews.

Screaming Reviews

How do you feel about people who scream in texts or email? What about screaming in reviews?
Here we examine the proportion of upper case letters in a review.

percentUpper = 100*str_count(videogame$review,pattern='[A-Z]')/nchar(videogame$review)
summary(percentUpper)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.689   2.520   3.365   3.622 100.000

Exclamation Marks

In messaging, reviews and chat, we are allowed to break all rules of grammar like using a flood of exclamation marks. Does that tell us something?
First, how common are exclamations?

percentExclamation = 100*str_count(videogame$review,pattern='!')/nchar(videogame$review)
summary(percentExclamation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.194   0.000  72.727

Are reviews with exclamations rated as being more helpful?

r_upper = cor.test(percentUpper,videogame$review_rating)
r_exclamation = cor.test(percentExclamation,videogame$review_rating)

correlations2 = data.frame(r = c(r_upper$estimate, r_exclamation$estimate),
                           p_value=c(r_upper$p.value, r_exclamation$p.value))

rownames(correlations2) = c('Upper Case','Exclamation Marks')
correlations2

Keywords

Beyond meta-data on reviews, one may be interested in specific keywords in reviews. If we are curious to see what % of reviews mention ‘minecraft’ in them. We will use str_detect to look for Minecraft as a single word, two words, and independent of case. As you look at the results, bear in mind, these reviews are from 2013-14, only a few years after its release.

library(stringr)
mean(str_detect(string=tolower(videogame$review),pattern='minecraft|mine craft'))*100
## [1] 0.1913552

Tokenize

One approach to analyzing text is to treat it as a “bag of words” where words are examined independent of their position in text. To implement this approach, text is decomposed into tokens (tidytext::unnest_tokens). The default token is words, but it is possible to specify other tokens such as ngram, characters, and sentences. The words can be summarized and organized using dplyr functions. Later on, we will also use the tidyr package to reshape the data.

Words per Reviews

Let us begin by counting the number of words in each review.

library(dplyr); library(tidytext)
videogame%>%
  select(id,review)%>%
  unnest_tokens(output = word,input=review)%>%
  group_by(id)%>%
  summarize(count = n())%>% 
  ungroup()

Total Words

So, what is the total number of words in all reviews?

videogame%>%
  select(id,review)%>%
  unnest_tokens(output = word,input=review)%>%
  count()  # count is a shortcut to summarize used above

Most Common Words

Let us see which words are used most frequently in these reviews. To do this, we will employ the tidytext library which uses a tidy data approach. Here are the top 25 words.

library(tidytext)
videogame%>%
  unnest_tokens(input = review, output = word)%>%
  select(word)%>%
  group_by(word)%>%
  summarize(count = n())%>%
  ungroup()%>%
  arrange(desc(count))%>%
  top_n(25)
videogame%>%
  unnest_tokens(input = review, output = word)%>%
  select(word)%>%
  group_by(word)%>%
  summarize(count = n())%>%
  ungroup()%>%
  arrange(desc(count))%>%
  top_n(25)%>%
  ggplot(aes(x=reorder(word,count), y=count, fill=count))+
    geom_col()+
    xlab('words')+
    coord_flip()

Not surprisingly the list contains a lot of prepositions and articles and words that don’t convey much meaning. Such words are called stopwords. Let us look at the top 25 list after removing the stopwords.

In the code below, this is accomplished through an anti-join with a list of stop words, tidytext::stop_words.

videogame%>%
  unnest_tokens(input = review, output = word)%>%
  select(word)%>%
  anti_join(stop_words)%>%
  group_by(word)%>%
  summarize(count = n())%>%
  ungroup()%>%
  arrange(desc(count))%>%
  top_n(25)
videogame%>%
  unnest_tokens(input = review, output = word)%>%
  select(word)%>%
  anti_join(stop_words)%>%
  group_by(word)%>%
  summarize(count = n())%>%
  ungroup()%>%
  arrange(desc(count))%>%
  top_n(25)%>%
  ggplot(aes(x=reorder(word,count), y=count, fill=count))+
    geom_col()+
    xlab('words')+
    coord_flip()

Now, set of top 25 words can also be generated using freq_terms from library(qdap), however we are not going to use this as installing qdap requires updating Java which has in the past posed a problem for some Mac users. In case you are interested, here is the code to generate the top 25 words. library(qdap) freq_terms(text.var = videogame$review,top = 25) plot(freq_terms(text.var = videogame$review,top = 25))

And, here is a list of top 25 words after removing stopwords based on the Top200Words list (rather than tidytext::stop_words). freq_terms(text.var=videogame$review,top=25,stopwords = Top200Words) plot(freq_terms(text.var=videogame$review,top=25,stopwords = Top200Words))

Categorize

One of the simplest approaches to natural language processing is to categorize words based on their meaning. Words may be categorized based on their valence (positive or negative), emotion (e.g., happy, sad), domain (e.g., finance), among other things. This can be done conveniently using a relevant lexicon. There are a wide variety of lexicons that can be used based on the goal of the analysis. We will the process of categorizing tokens by using a set of lexicons.

Binary Sentiment (positive/negative) Lexicons

These lexicons use a simple approach of classifying tokens into two categories, usually positive or negative. One common lexicon for doing so is ‘bing’

Bing Lexicon

The “bing” lexicon categorizes words as being positive or negative. The lexicon is included with the tidytext library and can be accessed by calling get_sentiments(‘bing’). Here are the first fifty words.

as.data.frame(get_sentiments('bing'))[1:50,]
get_sentiments('bing')%>%
  group_by(sentiment)%>%
  count()%>% 
  ungroup()

Valence of Words

Now, let us explore valence of the words used in reviews using the bing dictionary. We will match the words in the dictionary with the ones in the reviews to determine valence.

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(get_sentiments('bing'))

Valence in Reviews

Let’s find out the total number of positive and negative words in the reviews.

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>% 
  ungroup()
library(ggplot2)
library(ggthemes)

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ungroup()%>% 
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+geom_col()+theme_economist()+guides(fill=F)+
  coord_flip()

Postitive Words

Proportion of Positive words

Next, let us find out the proportion of words in reviews that are positive. This is the ratio of number of positive words to sum of positive and negative words.

videogame %>%
  select(id,review)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  summarize(n = n())%>%
  mutate(proportion = n/sum(n))%>% 
  ungroup()

Positive Words, helpful?

Let us drill down a bit more to see whether the proportion of positive words has any impact on its helpfulness. We will look at the proportion of positive (and negative words) for each rating.

videogame %>%
  select(id,review,review_rating)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(review_rating,sentiment)%>%
  summarize(n = n())%>%
  mutate(proportion = n/sum(n))%>% 
  ungroup()

and in pictures.

library(ggthemes)
videogame %>%
  select(id,review,review_rating)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(review_rating,sentiment)%>%
  summarize(n = n())%>%
  mutate(proportion = n/sum(n))%>%
  ungroup()%>% 
  ggplot(aes(x=review_rating,y=proportion,fill=sentiment))+geom_col()+theme_economist()+coord_flip()

Positive Reviews

Proportion of Positive Words per Review

Let us compute the proportion of positive words for each review. The proportion of positive words is the ratio of positive words and the sum of positive and negative words. This differs from the analysis above as it is computed for each review.

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(id)%>%
  summarize(positive_words = sum(sentiment=='positive'),
            negative_words = sum(sentiment=='negative'),
            proportion_positive = positive_words/(positive_words+negative_words))%>% 
  ungroup()

Correlation between Positive Words and Review Rating

Let us see if reviews with a lot of positive words are rated favorably.

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(id,review_rating)%>%
  summarize(positive_words = sum(sentiment=='positive'),
            negative_words = sum(sentiment=='negative'),
            proportion_positive = positive_words/(positive_words+negative_words))%>%
  ungroup()%>%
  summarize(correlation = cor(proportion_positive,review_rating))

NRC Sentiment Polarity Lexicon

There are a couple of nrc lexicons. This one is a binary sentiment lexicon that categorizes words as +1 or -1

library(lexicon)
head(hash_sentiment_nrc)
hash_sentiment_nrc %>%
  group_by(y)%>%
  summarize(count= n())%>%
  ungroup()

Since lexicons vary by number of words covered and manner of classification, results will vary by lexicon.

videogame %>%
  select(id, review)%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(y = hash_sentiment_nrc,by = c('word'='x'))%>%
  group_by(y)%>%
  summarize(count = n())%>%
  ungroup()

Emotion Lexicon

This family of lexicons categorizes words based on the emotion conveyed. Rather than grouping words as positive or negative, words are categorized based on the emotion reflected. Furthermore, a single word may reflect multiple emotions.

NRC Emotion Lexicon

A word may reflect more than just valence. The ‘nrc’ lexicon categorizes words by emotion. This lexicon which was previously a part of library(tidytext) was dropped from the package as of June 14, 2019 and later in 2019 re-integrated. To access nrc from tidytext, run get_sentiments('nrc') in interactive mode (i.e., console or script, not markdown) and agree to the non-commercial use agreement. This should only be required for the first-use. Another alternative is to use the lexicon that I have copied from its non-commercial use link and posted to github. This lexicon and a number of others can be found here but its free use is limited to non-commercial purposes. The following code will place the lexicon in a dataframe called nrc.

Method 1: Get Lexicon from tidytext

nrc = get_sentiments('nrc')

Method 2: Get Lexicon posted on github

nrc = read.table(
                 file ='https://raw.githubusercontent.com/pseudorational/data/master/nrc_lexicon.txt',
                 header =F,
                 col.names =c('word','sentiment','num'),
                 sep ='\t',
                 stringsAsFactors =F)
nrc = nrc[nrc$num!=0,]
nrc$num = NULL

Here is a list of emotions covered by this lexicon.

nrc%>%
  group_by(sentiment)%>%
  count()%>%
  ungroup()
table(nrc$sentiment)  
## 
##        anger anticipation      disgust         fear          joy     negative 
##         1247          839         1058         1476          689         3324 
##     positive      sadness     surprise        trust 
##         2312         1191          534         1231

Emotions in Reviews

Let us examine the emotions expressed in the reviews

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  arrange(desc(n))%>%
  ungroup()
videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ungroup()%>% 
  ggplot(aes(x=reorder(sentiment,X = n),y=n,fill=sentiment))+
         geom_col()+guides(fill=F)+coord_flip()+theme_wsj()

Emotion in Each Review

Here, we will examine each review and the frequency of each emotion expressed.

library(tidyr)
videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(nrc)%>%
  group_by(id,sentiment)%>%
  count()%>%
  pivot_wider(names_from = sentiment,values_from=n)%>%
  select(id, positive, negative, trust, anticipation, 
         joy, fear, anger, sadness, surprise, disgust)%>%
  mutate_at(.vars = 2:11, .funs = function(x) replace_na(x,0))%>% 
  ungroup()

Emotions and Rating

One may suspect that review rating will be tied to the emotion expressed. Let us explore this possibility.

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(nrc)%>%
  group_by(id,sentiment,review_rating)%>%
  count()%>%
  pivot_wider(names_from = sentiment,values_from = n)%>%
  mutate_at(.vars = 3:12, .funs = function(x) replace_na(x,0))%>%
  pivot_longer(cols = 3:12, names_to = 'sentiment',values_to = 'n')%>%
  ungroup()%>%
  group_by(sentiment, review_rating)%>%
  summarize(n = mean(n))%>% 
  ungroup()
videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(nrc)%>%
  group_by(id,sentiment,review_rating)%>%
  count()%>%
  pivot_wider(names_from = sentiment,values_from = n)%>%
  mutate_at(.vars = 3:12, .funs = function(x) replace_na(x,0))%>%
  pivot_longer(cols = 3:12, names_to = 'sentiment',values_to = 'n')%>%
  ungroup()%>%
  group_by(sentiment, review_rating)%>%
  summarize(n = mean(n)) %>% 
  ungroup()%>% 
  ggplot(aes(x=review_rating,y=n,fill=review_rating))+
  geom_col()+
  facet_wrap(~sentiment)+
  guides(fill=F)+
  coord_flip()+
  theme_bw()

Correlation between emotion expressed and review rating, by sentiment

Let us quantify the relationship by examining the correlation between frequency of emotions expressed and rating, by sentiment. Can you explain these results?

videogame%>%
  unnest_tokens(output = word, input = review)%>%
  inner_join(nrc)%>%
  group_by(id,sentiment,review_rating)%>%
  count()%>%
  pivot_wider(names_from = sentiment,values_from = n)%>%
  mutate_at(.vars = 3:12, .funs = function(x) replace_na(x,0))%>%
  pivot_longer(cols = 3:12, names_to = 'sentiment',values_to = 'n')%>%
  ungroup()%>%
  group_by(sentiment)%>%
  summarize(r = cor(n,review_rating))%>% 
  ungroup()

Sentiment Score Lexicons

afinn Sentiment

The bing and nrc emotion lexicons classify a word based on the presence or absence of an emotion or valence. The afinn lexicon scores each word based on the extent to which it is positive or negative. For instance, the afinn lexicon will make a distinction between words “satisfied” and “delighted” based on how positive they are, but the bing lexicon will simply categorize both as being positive.

get_sentiments('bing')[get_sentiments('bing')$word %in% c('satisfied','delighted'), ]

The afinn lexicon assigns a numeric value to the emotion. Thus this lexicon can be thought of as computing a sentiment score for words.

get_sentiments('afinn')[get_sentiments('afinn')$word %in% c('satisfied','delighted'),]

This lexicon which was previously a part of library(tidytext) was dropped from the package as of June 14, 2019 and later in 2019 re-integrated. To access afinn from tidytext, run get_sentiments('afinn') in interactive mode (i.e., console or script, not markdown) and agree to the non-commercial use agreement. This should only be required for the first-use. Another alternative is to use the lexicon that I have copied from its non-commercial use link and posted to github. This lexicon and a number of others can be found here but its free use is limited to non-commercial purposes. The following code will place the lexicon in a dataframe called nrc.

Method 1: Get Lexicon from tidytext

afinn = get_sentiments('afinn')

Method 2: Get Lexicon from github

afinn = read.table('https://raw.githubusercontent.com/pseudorational/data/master/AFINN-111.txt',
                   header = F,
                   quote="",
                   sep = '\t',
                   col.names = c('word','value'), 
                   encoding='UTF-8',
                   stringsAsFactors = F)
afinn[1:50,]

Let us examine the scores used to represent these emotions

afinn %>%
  group_by(value)%>%
  count()%>% 
  ungroup()

Review 617

Now, let’s use this dictionary to determine the sentiment of review 617.

videogame %>%
  select(id,review)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(afinn)%>%
  filter(id==617)%>%
  summarize(reviewSentiment = mean(value))

Here is the review

videogame$review[617]
## [1] "Like scottrocket3 said, you're out of your damn mind if you buy this for $200.  This is the best Mario game that ever came out.  The other is Super Mario World for Super Nintendo, also available for Nintendo DS.  I can't say enough about this great game!!  I LOVED the Super Mario Brothers Super Show featuring wrestling great Captain Lou Albano who, unfortunately passed away recently.  He was a Christian, so I will see him again.  Don't know about Danny Wells who did Luigi.  This is worth every penny you spend on it.  Unless of course you spend $100+dollars on this.  Mario first got me hooked on mushrooms.  Since then, I eat them by the truckload!!  They're good for you & have vitamin D, the sunshine vitamin."

All Reviews

videogame %>%
  select(id,review)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(afinn)%>%
  group_by(id)%>%
  summarize(reviewSentiment = mean(value))%>%
  summarize(min=min(reviewSentiment),max=max(reviewSentiment),
            median=median(reviewSentiment),mean=mean(reviewSentiment))%>% 
  ungroup()
videogame %>%
  select(id,review)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(afinn)%>%
  group_by(id)%>%
  summarize(reviewSentiment = mean(value))%>%
  ungroup()%>% 
  ggplot(aes(x=reviewSentiment,fill=reviewSentiment>0))+
  geom_histogram(binwidth = 0.1)+
  scale_x_continuous(breaks=seq(-5,5,1))+
  scale_fill_manual(values=c('tomato','seagreen'))+
  guides(fill=F)+
  theme_wsj()

Jockers Lexicon

Here is another sentiment lexicon. Scores for words range from -1 to +1.

library(lexicon)
head(key_sentiment_jockers)

Let us examine Reviews sentiment based on this lexicon

videogame %>%
  select(id,review)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(key_sentiment_jockers)%>%
  group_by(id)%>%
  summarize(reviewSentiment = mean(value))%>%
  summarize(min=min(reviewSentiment),max=max(reviewSentiment),
            median=median(reviewSentiment),mean=mean(reviewSentiment))%>% 
  ungroup()

Visualizing Sentiment scores

videogame %>%
  select(id,review)%>%
  group_by(id)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(key_sentiment_jockers)%>%
  summarize(reviewSentiment = mean(value))%>%
  ungroup()%>% 
  ggplot(aes(x=reviewSentiment,fill=reviewSentiment>0))+
  geom_histogram(binwidth = 0.02)+
  scale_x_continuous(breaks=seq(-1,1,0.2))+
  scale_fill_manual(values=c('tomato','seagreen'))+
  guides(fill=F)+
  theme_wsj()

Senticnet Lexicon

Here is yet another sentiment score lexicon with scores ranging from -1 to +1.

library(lexicon)
head(hash_sentiment_senticnet)

Let us examine Reviews sentiment based on this lexicon

videogame %>%
  select(id,review)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(hash_sentiment_senticnet, by = c('word'='x'))%>%
  group_by(id)%>%
  summarize(reviewSentiment = mean(y))%>%
  summarize(min=min(reviewSentiment),max=max(reviewSentiment),
            median=median(reviewSentiment),mean=mean(reviewSentiment)) %>% 
  ungroup()

Visualizing Sentiment scores

videogame %>%
  select(id,review)%>%
  group_by(id)%>%
  unnest_tokens(output=word,input=review)%>%
  inner_join(hash_sentiment_senticnet, by = c('word'='x'))%>%
  summarize(reviewSentiment = mean(y))%>%
  ungroup()%>%
  ggplot(aes(x=reviewSentiment,fill=reviewSentiment>0))+
  geom_histogram(binwidth = 0.01)+
  scale_x_continuous(breaks=seq(-1,1,0.2))+
  scale_fill_manual(values=c('tomato','seagreen'))+
  guides(fill=F)+
  theme_wsj()

Other Lexicons

There are of course many other lexicons.

head(hash_sentiment_sentiword)
hash_sentiment_socal_google
hash_sentiment_slangsd

Profanity

Under certain scenarios, one may want to filter out words such as when the words don’t add additional information or if they are offensive.

Here are a few lexicons that contain offensive words that one may want to use to filter out from one’s dataset.

  • profanity_alvarez
  • profanity_arr_bad
  • profanity_racist
  • profanity_zac_anger

To illustrate, here we are removing offensive words using two lexicons from videogame reviews

videogame %>%
  unnest_tokens(output = word, input = review)%>%
  select(id, word)%>%
  anti_join(y = data.frame(word = profanity_racist),
            by=c('word'='word'))

Visualizing Text

Wordcloud

In general, wordclouds offer little insight into the data, yet they tend to be very good at capturing interest of non-technical audiences. Let us begin by creating a wordcloud from our data using library(tidytext), library(dplyr), library(tidyr),and library(wordcloud) functions.

wordcloudData = 
  videogame%>%
  unnest_tokens(output=word,input=review)%>%
  anti_join(stop_words)%>%
  group_by(word)%>%
  summarize(freq = n())%>%
  arrange(desc(freq))%>%
  ungroup()%>%
  data.frame()

library(wordcloud)
set.seed(617)
wordcloud(words = wordcloudData$word,wordcloudData$freq,scale=c(2,0.5),
          max.words = 100,colors=brewer.pal(9,"Spectral"))

Comparison Cloud

Finally, here is a comparison cloud to contrast positive and negative words in the reviews.

library(tidyr)
wordcloudData = 
  videogame%>%
  unnest_tokens(output=word,input=review)%>%
  anti_join(stop_words)%>%
  inner_join(get_sentiments('bing'))%>%
  count(sentiment,word,sort=T)%>%
  spread(key=sentiment,value = n,fill=0)%>%
  data.frame()

rownames(wordcloudData) = wordcloudData[,'word']
wordcloudData = wordcloudData[,c('positive','negative')]

set.seed(617)
comparison.cloud(term.matrix = wordcloudData,scale = c(2,0.5),max.words = 200, rot.per=0)


This file was generated using R Version 4.1.2